Pengantar Pemrograman Triton: Di Luar Operasi Elemen per Elemen: Perpindahan ke Operasi Matriks Berbasis Keping

Pada pelajaran sebelumnya, kita fokus pada operasi elemen per elemen (seperti ReLU dasar pada matriks). Operasi ini adalah terbatas oleh memori karena GPU menghabiskan lebih banyak waktu untuk memindahkan data dari HBM ke register daripada melakukan perhitungan matematika.

1. Mengapa GEMM Sangat Penting

Perkalian Matriks Umum (GEMM) memiliki kompleksitas komputasi $O(N^3)$ sementara hanya membutuhkan akses memori $O(N^2)$. Ini memungkinkan kita menyembunyikan latensi memori di balik throughput aritmetika yang sangat besar, menjadikannya "jantung" dari model bahasa besar (LLMs).

2. Representasi Memori 2 Dimensi

RAM fisik bersifat 1 dimensi. Untuk merepresentasikan tensor 2 dimensi, kita menggunakan Stride. Kesalahan umum dalam produksi adalah mengasumsikan bahwa tensor bersifat kontinu. Jika Anda keliru menggabungkan stride baris dan kolom dalam perhitungan pointer Anda, Anda akan mengakses data "bayangan" atau memicu pelanggaran memori.

3. Generalisasi Berbasis Keping

Triton memperluas logika elemen per elemen dengan beralih dari pointer tunggal ke blok pointer. Dengan menggunakan keping 2 dimensi (misalnya $16 \times 16$), kita memanfaatkan penggunaan kembali data di SRAM berkecepatan tinggi, menjaga data tetap "panas" untuk operasi gabungan seperti penambahan bias atau aktivasi sebelum menulis kembali ke Memori Global.

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

Why is an elementwise ReLU on a large matrix considered 'memory-bound'?

The ReLU function requires complex transcendental math.

The ratio of arithmetic operations to memory loads is very low (1:1).

Matrices are naturally stored in CPU memory only.

Triton cannot process non-linear activations.

QUESTION 2

What is the result of 'The Stride Trap' in production kernels?

The kernel runs significantly faster but with less precision.

Memory access violations or corrupted output due to incorrect address calculation on non-contiguous tensors.

The GPU automatically corrects the indexing using L2 cache.

The tensor is forced into a 1D shape by the compiler.

QUESTION 3

How does Triton represent a 2D tile of pointers?

By using a nested Python list of integers.

By broadcasting a 1D column vector and a 1D row vector of offsets together.

By launching multiple 1D kernels sequentially.

By allocating a special 2D register file.

QUESTION 4

Which operation benefits most from the O(N³) complexity shift to hide memory latency?

Vector Addition

Matrix Multiplication (GEMM)

Sigmoid Activation

Global Average Pooling

QUESTION 5

List three kernels in your current workflow that launch multiple PyTorch ops and might benefit from fusion.

Linear -> Bias -> ReLU; LayerNorm -> Dropout; Softmax -> Masking.

Print -> Log -> Sleep.

DataLoader -> Augmentation -> Storage.

These ops cannot be fused in Triton.